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o. 

1—^ ' Abstract. Discovering pattern sets or global patterns is an attractive 

C/3 , issue from the pattern mining community in order to provide useful infor- 

O ■ mation. By combining local patterns satisfying a joint meaning, this ap- 

proach produces patterns of higher level and thus more useful for the data 
analyst than the usual local patterns, while reducing the number of pat- 
^ ■ terns. In parallel, recent works investigating relationships between data 

["--. ' mining and constraint programming (CP) show that the CP paradigm 

c7^ . is a nice framework to model and mine such patterns in a declarative 

^T ' and generic way. We present a constraint-based language which enables 

C^ I us to dehne queries addressing patterns sets and global patterns. The 

r~^ . usefulness of such a declarative approach is highlighted by several exam- 

f^ ' pies coming from the clustering based on associations. This language has 

been implemented in the CP framework. 



j^ . 1 Introduction 



Over the two last decades, local pattern discovery has became a rapidly growing 
field [TB] and several paradigms are available for producing extensive collections 
of patterns such as the constraint-based pattern mining [17], condensed rep- 
resentations of patterns [3], interestingness measures [7] as well as integrating 
external resources and background knowledge [TS] . Because of the exhaustive na- 
ture of the techniques, the pattern collections provide a fairly complete picture 
of the information content of the data. However, this approach suffers from lim- 
itations. First, the collections of patterns still remain too large for an individual 
and global analysis performed by the data analyst. Secondly, the so-called local 
patterns represent fragmented information and patterns expected by the data 



analyst require to consider simultaneously several local patterns. In this work, 
we propose a declarative approach addressing the issue of discovering patterns 
combining several local patterns. 

The data mining literature includes many methods to take into account the 
relationships between patterns and produce global patterns or pattern sets [418] . 
Recent approaches - constraint-based pattern set mining [4 , pattern teams [14) 
and selecting patterns according to the added value of a new pattern given 
the currently selected patterns [2] - aim at reducing the redundancy by selecting 
patterns from the initial large set of local patterns on the basis of their usefulness 
in the context of the other selected patterns. Even if these approaches explicitly 
compare patterns, they are mainly based on the reduction of the redundancy 
or specific aims such as classification processes. Heuristic functions are often 
used and the lack of methods to mine complete and correct pattern sets or 
global patterns may be explained by the difficulty of the task. Mining local 
patterns under constraints requires the exploration of a large search space but 
mining global patterns under constraints is even harder because we have to take 
into account and compare the solutions satisfying each pattern involved in the 
constraints. The lack of generic approaches restrains the discovery of useful global 
patterns because the user has to develop a new method each time he wants to 
extract a new kind of global patterns. It explains why this issue deserves our 
attention. 

In this paper, we propose a constraint-based language to discover patterns 
combining several local patterns. The data analyst expresses his/her queries 
thanks to constraints over terms built from constants, variables, operators, and 
function symbols. The key idea is to propose a generic and declarative approach 
to ask queries: the user models a problem by specifying a set of constraints and 
then a Constraint Programming (CP) system is responsible for solving it. This 
work is in the spirit of the cross-fertilization between data mining and CP which 
is a research field in emergence |10|llll2|13ll8|19j . 

The constraint-based language offers the great advantage to provide an easy 
method to address different problems: it is enough to change the declarative spec- 
ification in term of constraints. We illustrate the approach by several examples 
coming from the clustering based on associations: with simple query refinements, 
the data analyst is able to easily produce clusterings satisfying different proper- 
ties. We think that the process greatly facilitates the building of global patterns 
and the discovery of knowledge. We do not detail in this paper the solving step, 
a preliminary implementation of the constraint-based language is given in |12j . 

This paper is organized as follows. Section [2] describes the constraint-based 
language and shows how queries and constraints can be defined using terms and 
built-in constraints. Starting from the clustering example. Section [3] depicts the 
process of successive refinements which enables us to easily address several kinds 
of clustering and then the discovery of global models. 
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Table 1. Transactional dataset T. 



2 A Constraint-based Language 

In this section, we describe the constraint-based language we propose. Terms are 
built using constants, variables, operators, and function symbols. Constraints are 
relations over terms that can be satisfied or not. First, we recall definitions. Then, 
we describe terms and present how the data analyst can define new function 
symbols using operators and built-in function symbols. Finally, we introduce 
constraints and show how queries and constraints can be defined using terms 
and built-in constraints. 



2.1 Definitions and example 

Let I be a set of n distinct literals called items, an itemset (or pattern) is a 
non-null subset of I. The language of itemsets corresponds to Cx — 2-^\0. A 
transactional dataset is a multi-set of m itemsets of Cx. Each itemset, usually 
called a transaction or object, is a database entry. For instance. Table [1] gives 
a transactional dataset T where m—11 transactions ii, . . . , in are described by 
n=8 items A, B, C, D, E, F, G, H. 

Definition 1. (frequency) The frequency of a pattern is the number of trans- 
actions it covers. Let Xi be a pattern, freq(Xi) = \{t ^T \ Xi (Zt}\. 

So, freq({yl,_E}) = 3 and f req({C, F, G, i?}) — 1. The frequency constraint 
focuses on patterns occurring in the dataset a number of times exceeding a given 
minimal threshold: f req(Xi) > minfr. An other interesting measure to evaluate 
the relevance of patterns is the area [6] . 

Definition 2. (area) Let Xi be a pattern, area(Xi) = f req(Xi) x size{Xi) 
where s±ze{Xi) denotes the cardinality oi Xi. 

For transactional dataset T (see Table [T]), there are nine patterns satisfy- 
ing the constraint area(X) > 6 : {A,E,G}, {B,E,G}, {C,E,G}, {C,E,H}, 
{E,G},{C,E},{C,H},{E},{G}. 



2.2 Terms 

Terms are built using: 

1. constants are either numerical values (as threshold minfr), or items (as A) 
or patterns (as {A,B}) or transactions (as ^7). 

2. variables, noted Xi, for 1 < i < k, represent the unknown patterns. 

3. operators: 

— set operators as n, U, \, ... 

— numerical operators as +, — , x, /, . . . 

4. function symbols involving one or several patterns: f req/1, size/1, cover/ 1, 
overlapItems/2, overlapTransactions/2, . . . 

Terms are built using constants, variables, operators, and function symbols. 
Examples of terms: 

- freq(Xi) x size(Xi) 

- f req(Xi U X2) x size(Xi n X2) 

- f req(Xi) - f req(X2) 



i) Built-in function symbols. Our constraint based language owns predefined 
(built-in) function symboltl^ like: 

— cover{Xi) ~ {t \ t £ T, Xi C_ t} is the set of transactions covered by Xi. 

— freq(X,) == | {t\ter,X,Ct} 

— sizeiXi) ^\{j\j el, jex,}\ 

— overlapItems(Xi, Xj) =\ XiCl Xj \ is the number of items shared by both 
Xi and Xj. 

— overlapTrajisactions(Xi, Xj) = | cover(X,;) fl cover(Xj) | is the number 
of transactions covered by both Xi and Xj . 



ii) User-defined function symbols. The data analyst can define new func- 
tion symbols using constants, variables, operators and existing function symbols 
(built-in or previously defined ones). Examples: 

— area(Xi) — freq{Xi) x size(Xi) 

— coverage(Xi, Xj) = ±Teq{Xi U Xj) x s±ze{Xi ClXj) 

— Let Di, D2 C 7" be 2 sets of transactions and freq{Xi, Dj) the frequency of 
pattern Xi into Dj, then: 

^^ ^ .^^ I ^2 I xfreq(X„i?i) 
growth-rate(Ai) — 



Di I xfreq{X„D2) 



^ Only function symbols used in Section [3] are introduced in this paper. 



2.3 Constraints and Queries 

Constraints are relations over terms. They can be either built-in or user-defined. 
There are three kinds of constraints: 

1. numerical ones hke: <, <, =, ^, >,>,.. . 
Examples: 

- freq(Xi) < 10 

- size(X2) = 2 X size(X3) 

- area(Xi) < size(X2) x size(X3) 

2. set ones like: —, ^, G, ^, C, C, . . . 
Examples: 

- iaeXi 

- Xi U ^2 C ^3 

— Xi = X2 n x^^ 

3. dedicated ones like: 

— closed(Xi) is satisfied iff Xi is a closeqj pattern. 

— coverTransactions([Xi, ...,Xk\) is satisfied iff each transaction is cov- 
ered by at least one pattern (i.e. lJi<i<fc cover(Xi) = T), 

— coverItems([Xi, ..., Xk]) is satisfied iff every item belongs to at least one 
pattern (i.e. Ui<i<fc ^i = ^)- 

— canonical([Xi, ..., Xk]) is satisfied iff for all i s.t. 1 < i < fc, pattern Xi 
is less than pattern X^+i with respect to the lexicographic order. 

Queries and constraints are formulae built using constraints and logical con- 
nectors: A (conjunction) and V (disjunction). 

In the following, we take the exception rules as examplqj. An exception rule^ 
is a pattern combining a strong rule and a deviational pattern to the strong rule: 

{true if 3X2 G Li such that X2 C Xi , one have 
(Xi\X2 ^ /) A {Xi -^ -nl) 
false otherwise 

adding X2 to Xi\X2 provides the exception rule Xi — > -1/ 

— Xi\X2 — )■ / must be a frequent rule having a high confidence value: 

— Xi — >■ -1/ must be a rare rule having a high confidence value: 

to sum up: 

{freqiiXi \ X2) U /) > minfr A 
ifreqiX, \ X2) - freq{{X, \ X2) U /)) < <5i A 
freq(X, U -/) < maxfr A 
{freq{X^) - freq{X^ U -/)) < 82 



Let Tvi be the set of transactions covered by pattern Xi . Xi is closed iff Xi is the 

largest (c) pattern covering Tn. 
^ For more examples, see the modelling of the clustering problem (Section [2|. 
* The definition of exception rules initially presented in [20] also includes a reference 

rule X2 T^^ -iJ. 



3 From Modelling to Solving 

The major strength of our approach is to provide a simple and efficient way to 
refine a query. In practice, the data analyst begins with submitting a first query 
Qq. Then, he will successively refine this query (deriving Qi+i from Qi) until he 
considers that relevant information has been extracted. 

Clustering models aim at partitioning data into groups (clusters) so that 
transactions occurring in the same cluster are similar but different from those 
appearing in other clusters. We selected the clustering problem to illustrate our 
approach for two main reasons. First, clustering is an important and popular 
unsupervised learning method |1I5I9) . Then, by nature, clustering proceeds by 
iteratively refining queries until a satisfactory solution is found. The clustering 
model, used here, starts from closed patterns because a closed pattern is a pattern 
gathering the maximum amount of similarity between a set of transactions. 

3.1 Modelling a clustering query 

The usual clustering problem can be defined as follows: 

"to find a set of k closed patterns Xi,X2, ■■■,Xk covering all transactions 
without any overlap on these transactions" . 

First, closed(Xi) constraints (see Section [^75)1 are used to enforce each un- 
known pattern Xi to be closed. 

Then, it is easy to constrain the set of patterns to cover all the transac- 
tional dataset using the coverTransactions[Xi,X2, ..,Xk]) constraint (see Sec- 
tion ES]). 

Finally, to avoid any overlap over the transactions, for each couple of patterns 
{Xi,Xj),i < j, a constraint overlapTrELnsactions(Xi,Xj)=: is added. This 
constraint states that there is no transaction covered by both Xi and Xj. 

The following query (Qo) models the initial clustering problem: 

Ai<i<fe closed(Xi) A 

coverTransaction([Xi, ..., Xfc]) A 

Ai<,;<j<A; overlapTransactions(Xi, Xj) = 

On our running example, when looking for a clustering with fc = 3 patterns, 
we obtain 30 solutions (See Tabled]). 

3.2 Refining queries 

By only refining queries addressing a clustering, the data analyst can easily pro- 
duce clusterings satisfying different properties. In this section, we illustrate this 
approach by successive refinements. Starting from initial query Qq, symmetrical 
solutions are first removed leading to query Qi. Then, clusterings with non- 
frequent patterns and clusterings with small size patterns are removed (leading 
to queries Q2 and Q3). More generally, this process greatly facilitates the build- 
ing of global patterns and the discovery of knowledge. 
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Table 2. Set of all solutions (including symmetrical ones). 



i) Removing symmetrical solutions. Two solutions Si and Sj are said to be 
symmetrical iff there exists a permutation a, such that Sj = cr(si). A clustering 
problem owns intrinsically a lot of symmetrical solutions: let s = {pi,p2, ■■■,Pk) 
be a solution containing k patterns pi. Any permutation a of these k patterns 
a{s) — (Pct(i)jP(t(2)j •■•5Pcr(fe)) is also a solution. So, for any solution, there exist 
(/c! — 1) symmetrical solutions. For example, solutions from si to sg are symmet- 
rical (See Table [2|) and constitute the same clustering. 

Constraint cajionical([Xi, ...,Xfc]) is used to avoid symmetrical solutions. 
This constraint states that, for all i s.t. 1 < i < k, pattern Xi is less than 
pattern Xi^i with respect to the lexicographic order. 

From query Qq, we obtain query Qi : 

' /\i<i<k closed(Xi) A 
coverTransaction([Xi, ..., Xfe]) A 
Ai<i<j<fc overlapTrajisactions(A'i, Xj) =0A 
canonical([Xi, ..., Xk]) 

Following our running example, query Qi leads to only 5 solutions since 5x3!=30 
(See Table EJ. 

The constraint canonical([Xi, ...,Xk]) plays an important role. First, as the 
number of solutions (kl) grows very rapidly with the number k of clusters, it 
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Table 3. Set of different clusterings. 



quickly becomes very large. So, it is essential and indispensable to break the 
symmetries to avoid having a huge number of redundant solutions. Moreover, 
this constraint will perform an efficient filtering by drastically reducing the size 
of the search space. 



ii) Removing solutions v^rith non-frequent patterns. A clustering contain- 
ing at least one pattern having a low frequency is not considered to be relevant. 
To remove such solutions, we only need to add new constraints to the current 
query Qi. Such a constraint requires that each cluster must have a frequency 
greater than a threshold (here 10% oi m = 11). 

VI < i < fc, freq(Xj) > 2 

From query Qi, we obtain query Q2 : 

' Ai<i<A; closed(Xi) A 
coverTransaction([Xi, ..., Xfe]) A 
Ai<i<j<fe overlapTrajisactions(Xi, Xj) =0A 
canonical([Xi, ..., Xk]) A 

^ Ai<,<fc f req(X,) > 2 

Pattern {C,F,G,H} of solution si (see Tabled]) has a frequency of 1 which 
is less than the threshold. So for Q2, solution si is not valid. For query (52, there 
remain 4 solutions: S7, S13, sig, and §25 (See Table [3]). 



iii) Removing solutions vifith small size patterns. A clustering containing 
at least one pattern of size 1 is not considered to be relevanl|3. To remove such 
clusterings, we only need to add new constraints to the current query Q2. Such 
a constraint requires that each cluster must have a size greater than 1. This can 
be acheived by stating, for each cluster, a constraint to restrict its size. 



VI < i < /c, s±ze{X,) > 2 



^ Usally, clusterings using these unitary clusters reflect the discretisation of some at- 
tributes. 



From query Q2, we obtain query Q^ 



Ai<i<k closed(Xi) A 
coverTransaction([Xi, ..., Xfc]) A 
Ai<i<j<fe overlapTraiisactions(Xi, Xj) =0A 
canonical([Xi, ..., Xk]) A 
Ai<i<k freq(X,) > 2A 
, Ai<i<k size(Xi) > 2 

Query Qs has only 1 solution: 57 (see Table [3]) . For this solution, we have 
Xi = {A, F}, X2 = {C, H} and X3 ^ {E, G}. 



3.3 Solving other Clustering Problems 

In the same way, it is easy to express other clustering problems such as co- 
clustering, soft clustering and soft co-clustering. 

i) The soft clustering problem is a relaxed version of the clustering problem 
where small overlaps (less than St) on transactions are authorized. This problem 
is modelised by query Q4 (soft version of Qo): 

' Ai<i<k closed(Xi) A 
coverTransaction([Xi, ..., Xj:]) A 
Ai<i<j<fe overlapTraiisactions(Xi, Xj) < 5t A 
canonical([Xi, ..., X^]) 

Consider query Q^ with k—3 and a maximal overlap for transactions Sx=i- 
There are 13 solutions (see Table U]). If symmetries are not broken using the 
constraint canonical([Xi, ...,Xfc]), then there are 78 (3!xl3) solutions. 

For solution s[, patterns Xi and X3 cover transaction tn (see Table [1]). 
Moreover, patterns X2 and X3 cover transaction ^2 (see Table [1]). After having 
removed solutions with non-frequent patterns, there remain 8 solutions: from 
s'q to s'l^. After having removed solutions with small size patterns, it remains 
only 1 solution: s'g (which is the solution 57 of the initial clustering problem, see 
Section [311]). 



ii) The co-clustering problem consists in finding k clusters covering both 
the set of transactions and the set of items, without any overlap on transactions 
or on items. This problem is modelised by query Q5: 

Ai<i<k closed(Xi) A 
coverTransaction([Xi, ..., Xfe]) A 
Ai<i<j<k overlapTrEinsactions(Xi, Xj) =0A 
cover Items([Xi, ..., Xfe]) A 
Ai<i<j<fc overlapItems(Xi,Xj) =0A 
^ canonical([Xi,...,Xfc]) 
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Table 4. Set of different clusterings for query Q4, (soft clustering). 



iii) The soft co-clustering problem is a relaxed version of the co-clustering 
problem, allowing small overlaps on transactions (less than 5t) and on items 
(less than (5/). This problem is modelised by query Qe (soft version of (54 and 

Q5): 

Ai<i<fe closed(Xi) A 

coverTransaction([Xi, ..., Xfe]) A 

Ai<i<j<fc overlapTrajisactions(Xi, Xj) < Jy A 

cover Items([Xi, ..., Xfe]) A 

Ai<i<j<fe overlapItems(Xi, Xj) < (5/ A 

canonical([Xi, ..., Xk\) 



4 Conclusions and Future Works 



We have proposed a constraint-based language allowing to easily express different 
mining tasks in a declarative way. Thanks to the declarative process, extending 
or changing the specification to refine the results and get more relevant patterns 
or address new global patterns is very simple. Moreover, all constraints can be 
combined together and new constraints can be added. 

The effectiveness and the flexibility of our approach is shown on several exam- 
ples coming from clustering based on associations: thanks to query refinements, 
the data analyst is able to produce clusterings satisfying different constraints, 
thus generating more meaningful clusters and avoiding outlier ones. 

As future work, we intend to enrich our constraint-based language with fur- 
ther constraints to capture and model a wide range of data mining tasks. The 
scalability of the approach to larger values of k and larger datasets can also be 
investigated. Another promising direction is to integrate optimisation criteria in 
our framework. 
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